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Abstract 

In  this  paper,  we  show  that  we  can  ob¬ 
tain  a  good  baseline  performance  for 
Question  Answering  (QA)  by  using 
only  4  simple  features.  Using  these  fea¬ 
tures,  we  contrast  two  approaches  used 
for  a  Maximum  Entropy  based  QA  sys¬ 
tem.  We  view  the  QA  problem  as  a 
classification  problem  and  as  a  re¬ 
ranking  problem.  Our  results  indicate 
that  the  QA  system  viewed  as  a  re¬ 
ranker  clearly  outperforms  the  QA  sys¬ 
tem  used  as  a  classifier.  Both  systems 
are  trained  using  the  same  data. 

1  Introduction 

Open-Domain  factoid  Question  Answering  (QA) 
is  defined  as  the  task  of  answering  fact-based 
questions  phrased  in  Natural  Language.  Exam¬ 
ples  of  some  question  and  answers  that  fall  in 
the  fact-based  category  are: 


1.  What  is  the  capital  of  Japan?  -  Tokyo 

2.  What  is  acetaminophen?  -  Non-aspirin  pain  killer 

3.  Where  is  the  Eiffel  Tower?  -  Paris 


The  architecture  of  most  of  QA  systems  consists 
of  two  basic  modules:  the  information  retrieval 
(IR)  module  and  the  answer  pinpointing  module. 
These  two  modules  are  used  in  a  typical  pipeline 
architecture. 

Eor  a  given  question,  the  IR  module  finds  a 
set  of  relevant  segments.  Each  segment  typically 
consists  of  at  most  R  sentences'.  The  answer 
pinpointing  module  processes  each  of  these  seg¬ 
ments  and  finds  the  appropriate  answer  phrase. 

*  In  our  experiments  we  use  R=1 


phrase.  Evaluation  of  a  QA  system  is  judged  on 
the  basis  on  the  final  output  answer  and  the  cor¬ 
responding  evidence  provided  by  the  segment. 

This  paper  focuses  on  the  answer  pinpointing 
module.  Typical  QA  systems  perform  re -ranking 
of  candidate  answers  as  an  important  step  in 
pinpointing.  The  goal  is  to  rank  the  most  likely 
answer  first  by  using  either  symbolic  or  statisti¬ 
cal  methods.  Some  QA  systems  make  use  of 
statistical  answer  pinpointing  (Xu  et.  al,  2002; 
Ittycheriah,  2001;  Ittycheriah  and  Salim,  2002) 
by  treating  it  as  a  classification  problem.  In  this 
paper,  we  cast  the  pinpointing  problem  in  a  sta¬ 
tistical  framework  and  compare  two  approaches, 
classification  and  re -ranking. 

2  Statistical  Answer  Pinpointing 
2.1  Answer  Modeling 

The  answer-pinpointing  module  gets  as  input  a 
question  q  and  a  set  of  possible  answer  candi¬ 
dates  {aia2—a^}  .  It  outputs  one  of  the  answer 

ae{a^a2—a^}  from  the  candidate  answer  set. 
We  consider  two  ways  of  modeling  this  prob¬ 
lem. 

One  approach  is  the  traditional  classification 
view  (Ittycheriah,  2001)  where  we  present  each 
Question-Answer  pair  to  the  classifier  which 
classifies  it  as  either  correct  answer  (true)  or  in¬ 
correct  answer  (false),  based  on  some  evidence 
(features). 

In  this  case,  we  model  P(cla,{aia2..%},?)  . 
Here,  c  =  {true,  false}  signifies  the  correctness 
of  the  answer  a  with  respect  to  the  question  q. 

The  probability  P(c\a,{a^a2..ajCs,q)  for  each  QA 
pair  is  modeled  independently  of  other  such 
pairs.  Thus,  for  the  same  question,  many  QA 
pairs  are  presented  to  the  classifier  as  independ- 
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ent  events  (histories).  If  the  training  corpus  con¬ 
tains  Q  questions  with  A  answers  for  each  ques¬ 
tion,  the  total  number  of  events  (histories)  would 
he  equal  to  Q  A  with  two  classes  (futures)  (cor¬ 
rect  or  incorrect  answer)  for  each  event.  Once 
the  prohahilities  P(cla,{aja2-%},<?)  have  been 
computed,  the  system  has  to  return  the  best  an¬ 
swer.  The  following  decision  rule  is  used: 

A 

a  =  arg max[P(frMe  I  a,  {uj  }’  ?)] 

a 

Another  way  of  viewing  the  problem  is  as  a 
re -ranking  task.  This  is  possible  because  the  QA 
task  requires  the  identification  of  only  one  cor¬ 
rect  answer,  instead  of  identifying  all  the  correct 
answer  in  the  collection.  In  this  case,  we  model 
P(a  I  {a^a2...a^},q) .  If  the  training  corpus  contains 
Q  questions  with  A  answers  for  each  question, 
the  total  number  of  events  (histories)  would  be 
equal  to  Q,  with  A  classes  (futures).  This  view 
requires  the  following  decision-rule  to  identify 
the  answer  that  seems  most  promising: 

A 

a  =  argmax[P(a  I  {aj  }’  ?)] 


In  summary. 


Classifier 

Re -ranker 

#Events  (Histo¬ 
ries) 

QA 

Q 

#Classes  (Eutures) 
per  event 

2 

A 

where, 

Q  =  total  number  of  questions. 

A  =  total  number  of  answer  chunks  considered 
for  each  question. 

2.2  Maximum  Entropy  formulation 

We  use  Maximum  Entropy  to  model  the  given 
problem  both  as  a  classifier  and  a  re -ranker.  We 
define  M  feature  functions, 

f^(a,{aia2...a^},q),m  =  l,....,M  ,  that  may  be  useful 
in  characterizing  the  task.  Della  Pietra  et.  al 
(1995)  contains  good  description  of  Maximum 
Entropy  models. 

We  model  the  classifier  as  follows: 

M 

exp  cfmMaxa2.Mj^},q)\ 

P{c\a,{ai<h-‘^A)S  = - - 

exp 

m=l 


where, 

=  =  [true,  false}  are  the  model 

parameters. 

The  decision  rule  for  choosing  the  best  an¬ 
swer  is: 

A 

a  =  arg  max[P{true  I  a,  {Aj  A2...a^  },  qf] 

a 

M 

=  arg  max[^  ^m,truefm  (a,  {fli  Az  -“a  }>  ?)] 

“  m=l 

The  above  decision  rule  requires  comparison  of 
different  probabilities  of  the 

form  P(frMel  a,{aj  a2...a2j},^)  .  However,  these 
probabilities  are  modeled  as  independent  events 
(histories)  in  the  classifier  and  hence  the  training 
criterion  does  not  make  them  directly  compara¬ 
ble. 

Eor  the  re-ranker,  we  model  the  probability 
as: 

M 

exp[^A„fJa,{a^a2...a^],q)] 

Pi(l  I  {Aj  A2...A^},^)= - - 

m=l 

where, 

X„\m  =  1,....,  M  are  the  model  parameters. 

Note  that  for  the  classifier  the  model  parameters 
are  ,  whereas  for  the  re-ranker  they  are  . 

This  is  because  for  the  classifier,  each  feature 
function  has  different  weights  associated  with 
each  class  (future).  Hence,  the  classifier  has 
twice  the  model  parameters  as  compared  to  the 
re -ranker. 

The  decision  rule  for  the  re-ranker  is  given  by: 

A 

A  =  arg  max[P(A  I  {Aj  a2...a^ },  i?)] 

a 

M 

=  arg  max[^  X„f„  (a,  {  Aj  A2  ...a^  },  ^?)] 

“  m=l 

The  re -ranker  makes  the  probabilities 
P(al{a^a2--a^},q) ,  considered  for  the  decision 
rule,  directly  comparable  against  each  other,  by 
incorporating  them  into  the  training  criterion 
itself.  Table  1  summarizes  the  differences  of  the 
two  models. 


Table  1  ;  Model  comparison  between  a  Classifier  and  Re-ranker 


Classifier 

Re-Ranker 

Mode 

ling 

Equa¬ 

tion 

M 

exp  (a,  {fli  flj . .  } ,  q')] 

M 

exp[^  («>  {«!  ■  -"a  K  ?)] 

P{a\{a^a2..M2,],q)=  " 

m=l 

m=\ 

Deci¬ 

sion 

Rule 

A 

a  =  arg  max  {/’(true  \  a,[a^a2...a  q)] 

a 

M 

=  arg  max[^  («>  {«1  «2  -"a  1’  4)] 

“  m=l 

A 

a  =  arg  max[P(fl  \[a^a2...aj^],qy\ 

a 

M 

=  arg  max[^  («,{«;  02  },  ^)] 

“  m=l 

2.3  Feature  Functions 

Using  above  formulation  to  model  the  probabil¬ 
ity  distribution  we  need  to  come  up  with  features 
fj.  We  use  only  four  basic  feature  functions  for 
our  system. 

1.  Frequency:  It  has  been  observed  that  the 
correct  answer  has  a  higher  frequency 
(Magnini  et  al.;  2002)  in  the  collection  of 
answer  chunks  (C).  Hence  we  count  the 
number  of  time  a  potential  answer  occurs  in 
the  IR  output  and  use  its  logarithm  as  a  fea¬ 
ture.  This  is  a  positive  continuous  valued 
feature. 

2.  Expected  Answer  Class:  Most  of  the  current 
QA  systems  employ  some  type  of  Answer 
Class  Identification  module.  Thus  questions 
like  “When  did  Bill  Clinton  go  to  college?”, 
would  be  identified  as  a  question  asking 
about  a  time  (or  a  time  period),  “Where  is 
the  sea  of  tranquility?”  would  be  identified 
as  a  question  asking  for  a  location.  If  the  an¬ 
swer  class  matches  the  expected  answer 
class  (derived  from  the  question  by  the  an¬ 
swer  identification  module)  this  feature  fires 
(i.e.,  it  has  a  value  of  1).  Details  of  this  mod¬ 
ule  are  explained  in  Hovy  et  al.  (2002).  This 
is  a  binary-valued  feature. 

3.  Question  Word  Absent:  Usually  a  correct 
answer  sentence  contains  a  few  of  the  ques¬ 
tion  words.  This  feature  fires  if  the  candidate 
answer  does  not  contain  any  of  the  question 
words.  This  is  also  a  binary  valued  feature. 


4.  Word  Match:  It  is  the  sum  of  ITF^  values  for 
the  words  in  the  question  that  matches  iden¬ 
tically  with  the  words  in  the  answer  sen¬ 
tence.  This  is  a  positive  continuous  valued 
feature. 

2.4  Training 

We  train  our  Maximum  Entropy  model  using 
Generalized  Iterative  scaling  (Darroch  and 
Ratcliff,  1972)  approach  by  using  YASMET^ . 

3  Evaluation  Metric 

The  performance  of  the  QA  system  is  highly 
dependent  on  the  performance  of  the  two  indi¬ 
vidual  modules  IR  and  answer-pinpointing.  The 
system  would  have  excellent  performance  if 
both  have  good  accuracy.  Hence,  we  need  a 
good  evaluation  metric  to  evaluate  each  of  these 
components  individually.  One  standard  metric 
for  IR  is  recall  and  precision.  We  can  modify 
this  metric  for  QA  as  follows: 


^  ITF  =  Inverse  Term  Frequency.  We  take  a  large  inde¬ 
pendent  corpus  &  estimate  ITF(W)  =l/(count(W)),  where 
W  =  Word. 

^  YASMET.  (Yet  Another  Small  Maximum  Entropy 
Toolkit)  http://www-i6.informatik.rwth-aachen.de/ 
Colleagues/och/software/Y  ASMET.html 


Question : 

1395  Who  is  Tom  Cruise  married  to 

9 

IR 

.  Output : 

1 

Tom  Cruise  is  married  to  actress 

Nicole 

Kidman  and  they  have  two  adopted  children  . 

2 

Tom  Cruise  is  married  to  Nicole 

Kidman  . 

Output  of  Chunker:  (The  number  to 
which  that  particular  chunk  came) 

the  left 

of  each  chunk  records  the  IR  sentence  from 

1 

Tom  Cruise 

1 

Tom 

1 

Cruise 

1 

is  married 

1 

married 

1 

actress  Nicole  Kidman  and  they 

1 

actress  Nicole  Kidman 

1 

actress 

1 

Nicole  Kidman 

1 

Nicole 

1 

Kidman 

1 

they 

1 

two  adopted  children 

1 

two 

1 

adopted 

1 

children 

2 

Tom  Cruise 

2 

Tom 

2 

Cruise 

2 

is  married 

2 

married 

2 

Nicole  Kidman 

2 

Nicole 

2 

Kidman 

Figure  1  :  Candidate  answer  extraction  for  a  question. 


^  „  #  relevant  answer  segment  returned 

Recall  = - 

#  Total  relevant  answer  segments 

^  .  .  #  relevant  answer  segments  returned 

Precision  = - 

#  Total  segments  returned 

It  is  almost  impossible  to  measure  recall  be¬ 
cause  tbe  IR  collection  is  typically  large  and  in¬ 
volves  several  hundreds  of  thousands  of 
documents.  Hence,  we  evaluate  our  IR  by  only 
the  precision  measure  at  top  N  segments.  This 
method  is  actually  a  rather  sloppy  approximation 
to  the  original  recall  and  precision  measure. 
Questions  with  fewer  correct  answers  in  the  col¬ 
lection  would  have  a  lower  precision  score  as 
compared  to  questions  with  many  answers. 
Similarly,  it  is  unclear  how  one  would  evaluate 
answer  questions  with  No  Answer  (NIL)  in  the 
collection  using  this  metric.  All  these  questions 
would  have  zero  precision  from  the  IR  collec¬ 
tion. 


The  answer-pinpointing  module  is  evaluated 
by  checking  if  the  answer  returned  by  the  system 
as  the  top  ranked  (#1)  answer  is  correct/incorrect 
with  respect  to  the  IR  collection  and  the  true 
answer.  Hence,  if  the  IR  system  fails  to  return 
even  a  single  sentence  that  contains  the  correct 
answer  for  the  given  question,  we  do  not  penal¬ 
ize  the  answer-pinpointing  module.  It  is  again 
unclear  how  to  evaluate  questions  with  No  an¬ 
swer  (NIL).  (Here,  for  our  experiments  we  at¬ 
tribute  the  error  to  the  IR  module.) 

Finally,  the  combined  system  is  evaluated  by 
using  the  standard  technique,  wherein  the  An¬ 
swer  (ranked  #1)  returned  by  the  system  is 
judged  to  be  either  correct  or  incorrect  and  then 
the  average  is  taken. 


4  Experiments 

4.1  Framework 
Information  Retrieval 

For  our  experiments,  we  use  the  Web  search 
engine  AltaVista.  For  every  question,  we  re¬ 
move  stop-words  and  present  all  other  question 
words  as  query  to  the  Web  search  engine.  The 
top  relevant  documents  are  downloaded.  We 
apply  a  sentence  segmentor,  and  identify  those 
sentences  that  have  high  ITF  overlapping  words 
with  the  given  question.  The  sentences  are  then 
re -ranked  accordingly  and  only  the  top  K  sen¬ 
tences  (segments)  are  presented  as  output  of  the 
IR  system. 

Candidate  Answer  Extraction 

For  a  given  question,  the  IR  returns  top  K 
segments.  For  our  experiments  a  segment  con¬ 
sists  of  one  sentence.  We  parse  each  of  the  sen¬ 
tences  and  obtain  a  set  of  chunks,  where  each 
chunk  is  a  node  of  the  parse  tree.  Each  chunk  is 
viewed  as  a  potential  answer.  For  our  experi¬ 
ments  we  restrict  the  number  of  potential  an¬ 
swers  to  be  at  most  5000.  We  illustrate  this 
process  in  Figure  1 . 

Training/Test  Data 

Table  2  :  Training  size  and  sources. 


Training  H- 
Validation 

Test 

Question  collec¬ 
tion 

TREC  9  H- 
TREC  10 

TREC 11 

Total  questions 

1192 

500 

We  use  the  TREC  9  and  TREC  10  data  sets 
for  training  and  the  TREC  1 1  data  set  for  testing. 
We  initially  apply  the  IR  step  as  described  above 
and  obtain  a  set  of  at  most  5000  answers.  Eor 
each  such  answer  we  use  the  pattern  file  sup¬ 
plied  by  NIST  to  tag  answer  chunks  as  either 
correct  (1)  or  incorrect  (0).  This  is  a  very  noisy 
way  of  tagging  data.  In  some  cases,  even  though 


the  answer  chunk  may  be  tagged  as  correct  it 
may  not  be  supported  by  the  accompanying  sen¬ 
tence,  while  in  other  cases,  a  correct  chunk  may 
be  graded  as  incorrect,  since  the  pattern  file  list 
did  not  represent  a  exhaustive  list  of  answers. 
We  set  aside  20%  of  the  training  data  for  valida¬ 
tion. 

4.2  Classifier  vs.  Re-Ranker 

We  evaluate  the  performance  of  the  QA  system 
viewed  as  a  classifier  (with  a  post-processing 
step)  and  as  a  re-ranker.  In  order  to  do  a  fair 
evaluation  of  the  system  we  test  the  performance 
of  the  QA  system  under  varying  conditions  of 
the  output  of  the  IR  system.  The  results  are 
shown  in  Table  3. 

The  results  should  be  read  in  the  following 
way:  We  use  the  same  IR  system.  However,  dur¬ 
ing  each  run  of  the  experiment  we  consider  only 
the  top  K  sentences  returned  by  the  IR  system 
K={  1,10,50,100,150,200}.  The  column  “cor¬ 
rect”  represents  the  number  of  questions  the  en¬ 
tire  QA  (IR  -I-  re -ranker)  system  answered 
correctly.  “IR  Loss”  represents  the  average 
number  of  questions  for  which  the  IR  failed 
completely  (i.e.,  the  IR  did  not  return  even  a  sin¬ 
gle  sentence  that  contains  the  correct  answer). 
The  IR  precision  is  the  precision  of  the  IR  sys¬ 
tem  for  the  number  of  sentences  considered.  An¬ 
swer-pinpointing  performance  is  based  on  the 
metric  described  above.  Einally,  the  overall 
score  is  the  score  of  the  entire  QA  system,  (i.e., 
precision  at  rank#l). 

The  “Overall  Precision"  column  indicates 
that  the  re-ranker  clearly  outperforms  the  classi¬ 
fier.  However,  it  is  also  very  interesting  to  com¬ 
pare  the  performance  of  the  re-ranker  “Overall 
Precision”  with  the  “Answer-Pinpointing  preci¬ 
sion”.  Eor  example,  in  the  last  row,  for  the  re¬ 
ranker  the  “Answer-Pinpointing  Precision”  is 
0.5182  whereas  the  “Overall  Precision”  is  only 
0.34.  The  difference  is  due  to  the  performance  of 
the  poor  performance  of  the  IR  system  (“IR 
Loss”  =  0.344). 


Table  3  ;  Results  for  Classifier  and  Re-ranker  under  varying  conditions  of  IR. 


IR  Sen¬ 
tences 

Total 

ques¬ 

tions 

IR  Precision 

IR  Loss 

Answer-Pinpointing 

Precision 

Number  Correct 

Qverall  Precision 

Classifier 

Re-ranker 

Classifier 

Re-ranker 

Classifier 

Re-ranker 

1 

500 

0.266 

0.742 

0.0027 

0.3565 

29 

46 

0.058 

0.092 

10 

500 

0.2018 

0.48 

0.0016 

0.4269 

7 

111 

0.014 

0.222 

50 

500 

0.1155 

0.386 

0.0015 

0.4885 

6 

150 

0.012 

0.3 

100 

500 

0.0878 

0.362 

0.0015 

0.5015 

5 

160 

0.01 

0.32 

150 

500 

0.0763 

0.35 

0.0015 

0.5138 

5 

167 

0.01 

0.334 

200 

500 

0.0703 

0.344 

0.0015 

0.5182 

3 

170 

0.01 

0.34 

IR  Sentences  =  Total  IR  sentences  considered  for  every  question 
IR  Precision  =  Precision  @  (IR  Sentences) 

IR  Loss  =  (Number  of  Questions  for  which  the  IR  did  not  produce  a  single  answer)/(Total  Questions) 
Qverall  Precision  =  (Number  Correct)/(Total  Questions) 


4.3  Oracle  IR  system 

In  order  to  determine  the  performance  of  the 
answer  pinpointing  module  alone,  we  perform 
the  so-called  oracle  IR  experiment.  Here,  we 
present  to  the  answer  pinpointing  module  only 
those  sentences  from  IR  that  contain  an  answer"^. 
The  task  of  the  answer  pinpointing  module  is  to 
pick  out  of  the  correct  answer  from  the  given 
collection.  We  report  results  in  Table  4.  In  these 
results  too  the  re -ranker  has  better  performance 
as  compared  to  the  classifier.  However,  as  we 
see  from  the  results,  there  is  a  lot  of  room  for 
improvement  for  the  re-ranker  system,  even  with 
a  perfect  IR  system. 

5  Discussion 

Our  experiments  clearly  indicate  that  the  QA 
system  viewed  as  a  re-ranker  outperforms  the 
QA  system  viewed  as  a  classifier.  The  difference 
stem  from  the  following  reasons: 

1 .  The  classification  training  criteria  work  on  a 
more  difficult  objective  function  of  trying  to 
find  whether  each  candidate  answer  answers 
the  given  question,  as  opposed  to  trying  to 
find  the  best  answer  for  the  given  question. 
Hence,  the  same  feature  set  that  works  for 
the  re-ranker  need  not  work  for  the  classi¬ 
fier.  The  feature  set  used  in  this  problem  is 
not  good  enough  to  help  the  classifier  dis¬ 
tinguish  between  correct  and  incorrect  an¬ 


swers  for  the  given  question  (even  though  it 
is  good  for  the  re-ranker  to  come  up  with  the 
best  answer). 

2.  The  comparison  of  probabilities  across  dif¬ 
ferent  events  (histories)  for  the  classifier, 
during  the  decision  rule  process,  is  problem¬ 
atic.  This  is  because  the  probabilities,  which 
we  obtain  after  the  classification  approach, 
are  only  a  poor  estimate  of  the  true  probabil¬ 
ity.  The  re-ranker,  however,  directly  allows 
these  probabilities  to  be  comparable  by  in¬ 
corporating  them  into  the  model  itself. 

3.  The  QA  system  viewed  as  a  classifier  suf¬ 
fers  from  the  problem  of  a  highly  unbal¬ 
anced  data  set.  We  have  less  than  1% 
positive  examples  and  more  than  99%  nega¬ 
tive  examples  (we  had  almost  4  million 
training  data  events)  in  the  problem.  Ittyche- 
riah  (2001),  and  Ittycheriah  and  Roukos 
(2002),  use  a  more  controlled  environment 
for  training  their  system.  They  have  23% 
positive  examples  and  77%  negative  exam¬ 
ples.  They  prune  out  most  of  the  incorrect 
answer  initially,  using  a  pre-processing  step 
by  using  either  a  rule-based  system  (Ittyche¬ 
riah,  2001)  or  a  statistical  system  (Ittyche¬ 
riah  et  ah,  2002);  and  hence  obtain  a  much 
more  manageable  distribution  in  the  training 
phase  of  the  Maximum  Entropy  model. 


This  was  performed  by  extracting  all  the  sentences  that 
were  judged  to  have  the  correct  answer  by  human  evalua¬ 
tors  during  the  TREC  2002  evaluations. 


Table  4  ;  Performance  with  a  perfect  IR  system 


Total  ques¬ 
tions 

IR  precision 

Answer-Pinpointing 

Precision 

Classifier 

Re-ranker 

429 

1.0 

0.156 

0.578 

6  Conclusion 

The  re -ranker  system  is  very  robust  in  handling 
large  amounts  of  data  and  still  produces  reason¬ 
able  results.  There  is  no  need  for  a  major  pre¬ 
processing  step  (for  eliminating  undesirable  in¬ 
correct  answers  from  the  training)  or  the  post¬ 
processing  step  (for  selecting  the  most  promis¬ 
ing  answer.) 

We  also  consider  it  significant  that  a  QA  sys¬ 
tem  with  just  4  features  (viz.  Frequency, 
Expected  Answer  Type,  Question  word  absent, 
and  ITF  word  match)  is  a  good  baseline  system 
and  performs  better  than  the  median  perform¬ 
ance  of  all  the  QA  systems  in  the  TREC  2002 
evaluations^. 

Ittycheriah  (2001),  and  Ittycheriah  and  Rou- 
kos  (2002)  have  shown  good  results  by  using  a 
range  of  features  for  Maximum  Entropy  QA  sys¬ 
tems.  Also,  the  results  indicate  that  there  is 
scope  for  research  in  IR  for  QA  systems.  The 
QA  system  has  an  upper  ceiling  on  performance 
due  to  the  quality  of  the  IR  system.  The  QA 
community  has  yet  to  address  these  problems  in 
a  principled  way,  and  the  IR  details  of  most  of 
the  system  are  hidden  behind  the  complicated 
system  architecture. 

The  re-ranking  model  basically  changes  the 
objective  function  for  training  and  the  system  is 
directly  optimized  on  the  evaluation  function 
criteria  (though  still  using  Maximum  Likelihood 
training).  Also  this  approach  seems  to  be  very 
robust  to  noisy  training  data  and  is  highly  scal¬ 
able. 
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